The goal of this seminar is to play around with diffrent techniques for data visualization. We are going work on the well-known MNIST. The dataset consists of 60,000 grayscale images of hand-written digits of size 28 $ \times $ 28. Thus, the dimensionality of the input space is 784.
In [ ]:
%matplotlib inline
from time import time
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.offsetbox import AnnotationBbox, OffsetImage
As usual we provide the code that fetches the data:
In [ ]:
import os
from sklearn.datasets import fetch_mldata
# Fetch MNIST dataset and create a local copy.
if os.path.exists('mnist.npz'):
with np.load('mnist.npz', 'r') as data:
X = data['X']
y = data['y']
else:
mnist = fetch_mldata("MNIST original")
X, y = mnist.data / 255.0, mnist.target
np.savez('mnist.npz', X=X, y=y)
Now it's your turn to plot some random representatives from each of 10 (obviously) available classes:
In [ ]:
# Preferably arrange images as a 10x10 matrix.
The whole dataset is somewhat large so we restrict ourselves to the random subset of 5,000 images (corresponding indices are held in train_indices
):
In [ ]:
n_train_samples = 5000
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
train_indices = indices[: n_train_samples]
Your task is to try three different dimensionality reduction techniques which may be helpful for the visualization of high-dimensional data. The first one should be familiar to you: it's PCA with 2 components. The other two are non-linear embedding approaches described in the lecture, namely, MDS and LLE. Plot all three embeddings and check if they are good enough for the dataset.
NOTE: MDS (default settings) may take some time (>10 min) to compute, so be patient. It's probably a good idea to leave the exploration of this method as a hometask.
In [ ]:
from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import MDS, LocallyLinearEmbedding
# Your code goes here.
Now we are going to explore the manifold of 2's using Isomap. First, you need to compute the embedding of the corresponding subset of MNIST.
In [ ]:
from sklearn.manifold import Isomap
indices_of_2 = np.arange(X.shape[0])[y == 2]
np.random.shuffle(indices_of_2)
train_indices_of_2 = indices_of_2[: n_train_samples]
# Your code goes here.
After it is done we can track how an appearance of the digit changes along the line. One can take two most distant points as the endpoints of the interpolation segment. The following code should extract closest points to the line. Use scipy.spatial.KDTree
for the fast nearest neighbour computation.
In [ ]:
from scipy.spatial import KDTree
from scipy.spatial.distance import pdist, cdist, squareform
from scipy import linspace
def find_representatives(kdtree, from_point, to_point, n_points):
# Given two 2D points this function should return a sequence of the dataset representatives (indices) that
# we encounter nearby as we go from from_point to to_point. This can be done by taking a set points on
# the segment and finding corresponding nearest neighbours in the dataset.
# Your code goes here.
return representatives
n_points = 100
# Your code starts here and shoudld end with:
# representatives = find_representatives(kdtree, X_embedded[from_idx, :], X_embedded[to_idx, :], n_points)
Now we define a bunch of helper functions for interpolation visualization. Note the diplay_manifold_flythrough
function. First two arguments are array of images and their coordinates on the 2D plane.
In [2]:
from IPython.display import HTML
from matplotlib import animation
VIDEO_TAG = """<video controls>
<source src="data:video/x-m4v;base64,{0}" type="video/mp4">
Your browser does not support the video tag.
</video>"""
def anim_to_html(anim):
if not hasattr(anim, '_encoded_video'):
with open('temp.mp4', 'wb') as f:
anim.save(f.name, fps=20, extra_args=['-vcodec', 'libx264'], writer='ffmpeg')
video = open('temp.mp4', "rb").read()
anim._encoded_video = video.encode("base64")
return VIDEO_TAG.format(anim._encoded_video)
def display_animation(anim):
plt.close(anim._fig)
return HTML(anim_to_html(anim))
def diplay_manifold_flythrough(X, coords, fig, ax):
imagebox = OffsetImage(X[0].reshape(28, 28), cmap=plt.cm.gray_r)
annbox = AnnotationBbox(imagebox, coords[0])
ax.add_artist(annbox)
def init():
return imagebox, annbox
def animate(i):
imagebox.set_data(X[i].reshape(28, 28))
annbox.xyann = coords[i]
return imagebox, annbox
anim = animation.FuncAnimation(fig, animate, init_func=init, frames=X.shape[0], interval=20, blit=True)
return display_animation(anim)
Now use diplay_manifold_flythrough
to display the manifold fly-through over the scatter plot of the subset:
In [ ]:
fig = plt.figure(figsize=(10, 10))
ax = plt.axes(frameon=False)
plt.setp(ax, xticks=(), yticks=())
# Your code starts here and should end with:
# diplay_manifold_flythrough(..., ..., fig, ax)
Additionally create an animation of the interpolation. Specifically, you should obtain an animation similar to the one presented below:
In [4]:
HTML("""<video controls>
<source src='./interpolation.mp4' type='video/mp4'>
Your browser does not support the video tag.
</video>""")
Out[4]:
In [ ]:
def display_interpolation(X, steps, fig):
# NOTE: First argument corresponds to the sequence of images.
n_images = X.shape[0]
im = plt.imshow(X[0].reshape(28, 28), cmap=plt.cm.gray_r)
def init():
return im,
def animate(i):
# Your code goes here.
im.set_array(img)
return im,
anim = animation.FuncAnimation(fig, animate, init_func=init, frames=steps, interval=20, blit=True)
return display_animation(anim)
fig = plt.figure(figsize=(3, 3))
ax = plt.axes(frameon=False)
plt.setp(ax, xticks=(), yticks=())
# Your code starts here and should end with:
# display_interpolation(.., 500, fig)
If we need to visualize the whole set of digits, a much better embedding technique is t-SNE. This method has been proven to generate high-quality visualizations in various scenarios and has become a standard de-facto in analyzing high-dimensional data (like activities of neurons in an Artificial Neural Network). We are going to use a slower version of the algorithm supplied in the scikit-learn
(sklearn.manifold.TSNE
). In practice, however, I'd suggest to use an approximate Barnes-Hut t-SNE which produces comparable results while working substantially faster.
Your task here is similar to the ones in the previous section. The differeces are:
train_indices
subset)matplotlib
(demo) in order to manually pick a sequence of keypoints on a scatter plot. Given a list of keypoints, it's easy to obtain intermediate representatives using the find_representatives
function. The rest of the code should be almost identical to the Isomap-case. Don't forget about the interpolation animation :)NOTE 1: t-SNE is available in the latest version of scikit-learn
. There is a python
package called tsne
(pip install tsne
) which implements Barnes-Hut t-SNE. Feel free to use that one.
NOTE 2: It may be a good idea to apply PCA (with, for example, 50 components) to the data first.
In [ ]:
from sklearn.manifold import TSNE
# Your code goes here.